The primary objective of this work was to identify a sample of participants from our list of volunteers who are generally representative of the country overall.
The overall process will contain several steps: * Define Sampling Targets: We have a target file that defines strata based on combined characteristics of Sex, Age group, Ethnicity, Race, State, and Rural/Urban status (based on county), and we use the contributions of each stratum to the US population and apply it to the a target sample of 10,000 * Update Sampling Targets: * Mapping Volunteer Roster to Targets: We classify our roster of volunteers (i.e., people who emailed us and then filled out a preliminary Qualtrics survey) into categories that align with our targets. Then, we then take the sum of volunteers available in each of the possible 9936 sub-classifications that exist in our target file and match them to the target file. From here, we define a probability of being sampled as the target number, divided by the number in our pool of volunteers available to us for that stratum .
When we ran this analysis on the number of people currently available, there are two concerns or issues: (1) a number of the sub-classifications have a target value less than one, so sampling even one person would over-sample that specific group; and (2) there are other sub-classifications where our pool available falls short of desired target number. Thus, after taking into account this strict matching on all possible characteristics of interest, we increase sampling weights on the margins of categories where the sample size is deficient.
After applying these sampling weight adjustments, we then take 10,000 random samples of size S (our desired sample size is for that day’s list), and determine which sample most closely aligns with the overall demographic characteristic of our desired targets, using a squared sum of differences approach.
Sampling targets proportions were defined for each combination of the following characteristics: 1. urbanicity (u) , 3 values [completely rural, mostly rural, mostly urban] 2. state (v), 51 values [50 States + District of Columbia] 3. age group (w), 3 values [18-44, 45-69, 70+] 4. sex (x), 2 values [Male, Female] 5. ethnicity (y), 2 values [Hispanic, Non-Hispanic] 6. race (z), 6 values [White, Black, Native American, Asian, Pacific Islander, More than one Race]
We defined target sample sizes, based on each sub-population’s relative contribution to the overall US population (p_uvwxyz) and applying that to our overall study sample size of 10,000:
T_uvwxyz = p_uvwxyz * 10000
While many of these proportions are very small, we keep all categories in our targets, so that we can measure our observed sample against the “ideal” sample (even if it that sample contains fractions of people).
This dataset has one row for each unique, observed combination of the above characteristics (9936 out of 11,016 possible), and one column for each of the six characteristic categories describe above, as well as the target sample size. A sample of the first few rows of this dataset is provided below:
# NEXT, LET'S AGGREGATE TARGET NUMBERS UP TO URBAN/RURAL (FROM COUNTY)
targets <- targets %>%
group_by(region, STNAME, AGEGRP, sex, hispanic, race, urban.rural) %>%
summarise(pop = sum(pop), R10.000 = sum(R10.000)) %>%
as.data.frame()
head(targets)
## region STNAME AGEGRP sex hispanic race urban.rural pop R10.000
## 1 1 Connecticut 1 F H AA mostly urban 930.8 0.03670005
## 2 1 Connecticut 1 F H BA mostly urban 15331.4 0.60449408
## 3 1 Connecticut 1 F H IA mostly urban 2662.8 0.10499021
## 4 1 Connecticut 1 F H PA mostly urban 508.4 0.02004545
## 5 1 Connecticut 1 F H TOM mostly urban 4707.4 0.18560572
## 6 1 Connecticut 1 F H WA mostly urban 99421.2 3.92002864
If a number of participants have already been enrolled, we need to adjust our sampling targets to account for these data. So first, we compile the demographic characteristics of the enrolled sample so that they can be mapped back 1:1 to our target sample data, and the number enrolled in each category is subtracted from the target sample size.
A sample of the first few rows of this update data set is shown below:
From here, we update our target sample by subtracting the number who consented (tracked in RedCap) from the original target number, to give us the ‘New Target’. After doing this, we can check that we have enrolled 3683 out of our 10000 and have 6317 left to enroll in the study.
At this step, we would KEEP negative targets, because most of the sub-classifications have targets < 1, so even getting one person into the group will make the target go negative; however, we are not focused on the specific categories and instead look at the overall proportions. In other words, we need an overall even distribution of males and females, but they do not necessarily have to (or can) be exactly even within each county and age group and race.
Since volunteers were asked to submit demographic information through an online survey, we are able to map volunteers’ characteristics back to their target sample size, and assign each person within the demographic sub-category an equal probability of being selected. In other words, if we needed 10 Asian Non-Hispanic women, aged 18-44. living in a mostly urban area of California, and there were 100 volunteers with these reported characteristics in our sample, each of them would have a 10% chance of being selected for this category. [Note: Here, we need only calculate the probabilities within each stratum, as the sampling function will account for each stratum’s contribution to the overall sample.]
To implement this, we first remove anyone who was already approached or who has consented to the study from our volunteer roster. Then, we calculate the total number eligible within each stratum used in the targets data.
After that, we should have two data sets that can be merged 1:1 and give us the following:
We use this dataset for two purposes: (1) to determine which categories have insufficient sample size (using our strict matching) and should be up-weighted; and (2) to estimate the sampling probabilities for each person in our pool of volunteers.
We first use this information to define the demographic classifications where we are deficient. So we can look at these numbers overall to assess how well/poorly we are doing on each of the six categories defined in our objectives. To do this, we sum the new target number needed by each demographic characteristic, and the sum of the number available (on our strict matching criteria), and then we calculate the shortfall as the difference between the new target (newtarget) and the number available in that specific category (pool).
Or, to look at just non-region 3:
If we appear to be deficient in certain categories, then we can simply increase the probability weights on people with these characteristics. In other words, if we are deficient in people aged 70+, we can increase the sampling probability for anyone in this age group by 10-50%. This adjustment range seems to be sufficient to get us relatively close to the desired targets (for now) but we could consider adjusting them later.
For each of our 9936 sub-classifications, then, we can designate a probability sample as the updated target number (newtarget) within that sub-classification divided by the pool of volunteers available in the corresponding sub-classification.
sdat <- edat %>% inner_join(pooln3, by = fnames)
sdat$prob <- (sdat$newtarget / sdat$pool)
head(sdat)
## X Id Hispanic__c AGEGRP sex hispanic race STNAME
## 1 2 0031U00001K5ErYQAV FALSE 2 F NH WA Washington
## 2 3 0031U00001K5GK3QAN FALSE 2 F NH WA Pennsylvania
## 3 4 0031U00001K5I8PQAV FALSE 1 M NH WA Colorado
## 4 5 0031U00001K5IXFQA3 FALSE 2 F NH WA Florida
## 5 6 0031U00001K5IiPQAV FALSE 2 F NH WA Wisconsin
## 6 7 0031U00001K5JOMQA3 FALSE 3 M NH WA Colorado
## urban.rural pool region pop R10.000 n newtarget shortfall
## 1 mostly urban 1859 6 833158.0 32.850169 7 25.850169 0
## 2 mostly urban 3306 1 1533008.0 60.444204 13 47.444204 0
## 3 mostly urban 689 5 694018.8 27.364119 6 21.364119 0
## 4 mostly urban 2820 4 2032876.0 80.153248 23 57.153248 0
## 5 mostly rural 141 2 180621.0 7.121615 3 4.121615 0
## 6 mostly urban 134 5 175886.0 6.934921 3 3.934921 0
## prob
## 1 0.01390542
## 2 0.01435094
## 3 0.03100743
## 4 0.02026711
## 5 0.02923131
## 6 0.02936508
Some specific sub-classifications may now have a negative target because the initial target was less than one and we recruited one person. Since we have technically exceeded the target needed in this sub-group, we should be able to remove these individuals from the sampling pool.
We then increased probability weights on demographics where sample size was insufficient by more than 10% , based on marginal demographics. Since there are potential compounding effects of up-weighting groups that filled multiple deficient categories (e.g., if we lacked older African-American women, then we may end up increasing the weights three times: once for age, once for race, and once for sex), adjustments were ad-hoc, based on the judgments of the investigators.
Next, we will run multiple samples of the population, using the built-in sample function in R, which allows us to specify a desired sample size and sampling probabilities for each row (volunteer) in our table. For exploratory purposes, I’ve fun 1000 draws, compraing a two approaches:
We used separate calculations to determine the number of people who should be called per day for UAB and Pitt (s1 = s1a + s1b) or for NIAID (s2). We then take sample of size s from our volunteer pool, where each individual’s probability of being selected were defined in the previous section. We repeat this sampling 10,000 times so that we have 10,000 sets of samples of size s.
First, we set up our objective proportions for each demographic category of interest:
# Because lists were generated for Region 3 vs. Others independently, we set different objective independently
targetlist <- data.frame()
for (i in fnames) {
targetlist <- rbind(targetlist,
aggregate(targets$newtarget[targets$region!=3],
by = list(targets[targets$region!=3, i]),FUN = sum)
)
}
datsummary1 <- as.data.frame(t(targetlist$x)/sum(targets$newtarget[targets$region!=3]))
names(datsummary1) <- targetlist$Group.1
datsummary1
## Alabama Alaska Arizona Arkansas California Colorado
## 1 0.01634086 0.002293835 0.02740116 0.01195849 0.145511 0.02183972
## Connecticut Florida Hawaii Idaho Illinois Indiana
## 1 0.01133317 0.08089536 0.005892304 0.005634082 0.04504671 0.02475471
## Iowa Kansas Louisiana Maine Massachusetts Michigan
## 1 0.01236783 0.009604085 0.01792928 0.004072198 0.02534486 0.03787343
## Minnesota Mississippi Missouri Montana Nebraska Nevada
## 1 0.02155056 0.01279446 0.02338686 0.003889198 0.006812504 0.01117264
## New Hampshire New Jersey New Mexico New York North Dakota Ohio
## 1 0.004848394 0.03127979 0.008125055 0.0727431 0.002821704 0.04356165
## Oklahoma Oregon Pennsylvania Rhode Island South Dakota Texas
## 1 0.01474411 0.01457041 0.04538289 0.003808226 0.003594947 0.106305
## Utah Vermont Washington Wisconsin Wyoming completely rural
## 1 0.01069804 0.001952831 0.02666439 0.0213345 0.00186569 0.01353228
## mostly rural mostly urban 1 2 3 F M
## 1 0.1010119 0.8854558 0.4691001 0.3920358 0.1388641 0.505003 0.494997
## H NH AA BA IA PA TOM
## 1 0.2010186 0.7989814 0.06428357 0.1095954 0.0134142 0.002541797 0.0097808
## WA
## 1 0.8003842
targetlist <- data.frame()
for (i in fnames) {
targetlist <-
rbind(targetlist,
aggregate(targets$newtarget[targets$region==3],
by = list(targets[targets$region==3, i]),
FUN = sum)
)
}
datsummary3 <- as.data.frame(
t(targetlist$x)/sum(targets$newtarget[targets$region==3]))
names(datsummary3) <- targetlist$Group.1
datsummary3
## Delaware District of Columbia Georgia Kentucky Maryland
## 1 -0.02119952 -0.05047381 0.3232224 0.1324116 -0.1924382
## North Carolina South Carolina Tennessee Virginia West Virginia
## 1 0.2981709 0.1412649 0.2178991 0.1060246 0.04511804
## completely rural mostly rural mostly urban 1 2 3
## 1 0.01765557 0.3018162 0.6805282 0.4608586 0.4181942 0.1209472
## F M H NH AA BA IA
## 1 0.4996712 0.5003288 -0.05881416 1.058814 0.006907971 0.3998407 0.005954097
## PA TOM WA
## 1 0.002308056 -0.03747973 0.6224689
Next, we take a specified number of draws of a random sample from region 3 and all other regions:
# THIS USES MULTIPLE CORES, BASED ON AVAILAIBLITY ON THE MACHINE, TO COMPLETE SAMPLING PROCESS EFFICIENTLY
(numCores <- detectCores())
## [1] 8
registerDoParallel(numCores) # use max available
# SET NUMBER OF REPILCATE SAMLES
n <- 30 # 20000
s1 <- 15*20 + 8*40 # desired sample size
gc(); print(Sys.time())
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1629245 87.1 3034832 162.1 2423381 129.5
## Vcells 12679118 96.8 21840629 166.7 18133344 138.4
## [1] "2021-04-02 08:02:08 EDT"
# SET SEED, IF DESIRED:
set.seed(1001)
r3sample <- list()
samples <- list()
rdat3 <- rdat %>% select(c("Id", all_of(fnames), "region")) %>% filter(region != 3)
dat1 <- sdat[sdat$region!=3, ]
rows <- foreach (i=1:n) %dopar% { sample(x = dat1$Id, size = s1, replace = FALSE, prob = dat1$prob) }
# GARBAGE COLLECTOR / PRINT SYS TIME TO MONITOR PROCESSING TIME AND CLEAN-UP MEMORY
# gc(); print(Sys.time())
samples <- foreach(i=1:n) %dopar% { sdat[sdat$Id %in% rows[[i]], c("Id", fnames, "region")] }
# print(Sys.time()); gc()
f <- length(fnames)
rdat3 <- rdat %>% filter(region!=3) %>% select(c("Id", all_of(fnames), "region")) %>% as.data.frame()
# gc(); print(Sys.time())
l1 <- l3 <- c()
for (i in 1:n) {
k1 <- samples[[i]][ ,c("Id", all_of(fnames), "region")]
k2 <- rbind(k1, rdat3)
l3 <- foreach (j=1:f, .combine=c) %do% {
as.vector(prop.table(table(k2[, fnames[j] ] ) ) )
}
datsummary3 <- rbind(datsummary3, l3)
l1 <- foreach (j=1:f, .combine=c) %do% {
as.vector(prop.table(table(k1[, fnames[j] ] ) ) )
}
datsummary1 <- rbind(datsummary1, l1)
}
# gc(); print(Sys.time())
To select the “best” sample (out of the 10,000 available), we calculate the marginal proportions of our sample in each group and compare it to the desired breakdown according to our updated targets. In other words, we will calculate the percent male/female in each of the 10,000 samples, and compare that to the percent male/female according to our updated targets. We weight each characteristic equally, so that deviation from the desired proportions for a specific state are not weighted equally as deviation of one sex. Instead, deviations for all states as a whole are weighted equally as deviations across races, sex, or age group.
# Currently set up to evaluate sample for non-region 3
# Commented lines evalute Region 3 only
eval3 <- eval1 <- list()
eval1[[1]] <- eval3[[1]] <- 2:3
eval1[[2]] <- eval3[[2]] <- 4
eval1[[3]] <- eval3[[3]] <- 6
eval1[[4]] <- eval3[[4]] <- 8:12
eval1[[5]] <- eval3[[5]] <- 14:15
eval1[[6]] <- eval3[[6]] <- 17:ncol(datsummary1)
for (m in 1:6) {
for (i in 1:(n+1)) {
v <- w <- 0
for (j in eval1[[m]]) {
w <- abs(datsummary1[i, j] - datsummary1[1, j])
w <- w**2
v <- sum(v, w)
x <- y <- 0
x <- abs(datsummary3[i, j] - datsummary3[1, j])
x <- x**2
y <- sum(y, x)
}
datsummary1[i, paste("dist",m,sep="")] <- v
datsummary3[i, paste("dist",m,sep="")] <- y
}
}
datsummary1$dist <- (datsummary1$dist1)/2 + (datsummary1$dist2) + (datsummary1$dist3) +
(datsummary1$dist4)/5 + (datsummary1$dist5)/2 + (datsummary1$dist6)/(length(eval1[[6]]))
# datsummary3$dist <- (datsummary3$dist1)/2 + (datsummary3$dist2) + (datsummary3$dist3) +
# (datsummary3$dist4)/5 + (datsummary3$dist5)/2 + (datsummary3$dist6)/(length(eval3[[6]]))
head(datsummary1[order(datsummary1$dist), ])
## Alabama Alaska Arizona Arkansas California Colorado
## 1 0.016340864 0.002293835 0.02740116 0.01195849 0.1455110 0.02183972
## 12 0.014516129 0.003225806 0.02419355 0.01774194 0.1467742 0.02580645
## 23 0.008064516 0.003225806 0.03225806 0.01129032 0.1612903 0.02258065
## 30 0.014516129 0.003225806 0.02419355 0.00483871 0.1403226 0.03548387
## 16 0.014516129 0.003225806 0.02580645 0.00483871 0.1387097 0.02580645
## 9 0.003225806 0.001612903 0.03064516 0.01290323 0.1741935 0.03548387
## Connecticut Florida Hawaii Idaho Illinois Indiana
## 1 0.011333173 0.08089536 0.005892304 0.005634082 0.04504671 0.02475471
## 12 0.029032258 0.08064516 0.001612903 0.009677419 0.05483871 0.02258065
## 23 0.017741935 0.05645161 0.003225806 0.004838710 0.04516129 0.03548387
## 30 0.014516129 0.08548387 0.004838710 0.009677419 0.05806452 0.02419355
## 16 0.006451613 0.08548387 0.003225806 0.009677419 0.05000000 0.02903226
## 9 0.009677419 0.05161290 0.004838710 0.003225806 0.05000000 0.01774194
## Iowa Kansas Louisiana Maine Massachusetts Michigan
## 1 0.01236783 0.009604085 0.017929276 0.004072198 0.02534486 0.03787343
## 12 0.01774194 0.012903226 0.009677419 0.014516129 0.02580645 0.03709677
## 23 0.01451613 0.014516129 0.008064516 0.003225806 0.02096774 0.04516129
## 30 0.01129032 0.011290323 0.011290323 0.004838710 0.02096774 0.03387097
## 16 0.01774194 0.004838710 0.011290323 0.004838710 0.02741935 0.04677419
## 9 0.01129032 0.011290323 0.009677419 0.009677419 0.04354839 0.03064516
## Minnesota Mississippi Missouri Montana Nebraska Nevada
## 1 0.02155056 0.012794459 0.02338686 0.003889198 0.006812504 0.011172643
## 12 0.02903226 0.004838710 0.02903226 0.003225806 0.003225806 0.003225806
## 23 0.02741935 0.004838710 0.02258065 0.003225806 0.006451613 0.011290323
## 30 0.01129032 0.004838710 0.01451613 0.004838710 0.004838710 0.006451613
## 16 0.03225806 0.004838710 0.02258065 0.004838710 0.004838710 0.014516129
## 9 0.02580645 0.006451613 0.02419355 0.003225806 0.003225806 0.001612903
## New Hampshire New Jersey New Mexico New York North Dakota Ohio
## 1 0.004848394 0.03127979 0.008125055 0.07274310 0.002821704 0.04356165
## 12 0.006451613 0.02741935 0.009677419 0.06129032 0.001612903 0.04838710
## 23 0.001612903 0.03387097 0.008064516 0.08225806 0.003225806 0.04838710
## 30 0.008064516 0.04838710 0.008064516 0.05000000 0.001612903 0.07258065
## 16 0.004838710 0.03064516 0.008064516 0.05161290 0.004838710 0.06290323
## 9 0.008064516 0.03709677 0.009677419 0.08225806 0.004838710 0.02741935
## Oklahoma Oregon Pennsylvania Rhode Island South Dakota Texas
## 1 0.014744107 0.01457041 0.04538289 0.003808226 0.003594947 0.10630498
## 12 0.009677419 0.02419355 0.05645161 0.001612903 0.001612903 0.05967742
## 23 0.009677419 0.02096774 0.05000000 0.006451613 0.006451613 0.07903226
## 30 0.009677419 0.02258065 0.04677419 0.001612903 0.004838710 0.09354839
## 16 0.014516129 0.01612903 0.05645161 0.001612903 0.004838710 0.07258065
## 9 0.011290323 0.02096774 0.06935484 0.004838710 0.001612903 0.07741935
## Utah Vermont Washington Wisconsin Wyoming completely rural
## 1 0.010698042 0.001952831 0.02666439 0.02133450 0.001865690 0.013532284
## 12 0.009677419 0.004838710 0.03548387 0.01774194 0.003225806 0.008064516
## 23 0.016129032 0.003225806 0.02419355 0.01774194 0.004838710 0.011290323
## 30 0.012903226 0.001612903 0.02580645 0.03064516 0.001612903 0.012903226
## 16 0.006451613 0.003225806 0.04032258 0.02580645 0.001612903 0.011290323
## 9 0.008064516 0.003225806 0.02258065 0.03225806 0.003225806 0.008064516
## mostly rural mostly urban 1 2 3 F M
## 1 0.10101192 0.8854558 0.4691001 0.3920358 0.1388641 0.5050030 0.4949970
## 12 0.11774194 0.8741935 0.4258065 0.4370968 0.1370968 0.5129032 0.4870968
## 23 0.08064516 0.9080645 0.4338710 0.4370968 0.1290323 0.5322581 0.4677419
## 30 0.08870968 0.8983871 0.4096774 0.4516129 0.1387097 0.5161290 0.4838710
## 16 0.08709677 0.9016129 0.4338710 0.4161290 0.1500000 0.5112903 0.4887097
## 9 0.07258065 0.9193548 0.4596774 0.4112903 0.1290323 0.5145161 0.4854839
## H NH AA BA IA PA
## 1 0.2010186 0.7989814 0.06428357 0.109595391 0.013414204 0.002541797
## 12 0.1693548 0.8306452 0.07419355 0.019354839 0.906451613 0.014516129
## 23 0.1854839 0.8145161 0.09032258 0.001612903 0.020967742 0.887096774
## 30 0.1693548 0.8306452 0.06935484 0.001612903 0.001612903 0.014516129
## 16 0.1790323 0.8209677 0.08225806 0.003225806 0.009677419 0.904838710
## 9 0.1951613 0.8048387 0.07741935 0.003225806 0.019354839 0.900000000
## TOM WA dist1 dist2 dist3 dist4
## 1 0.009780800 0.800384235 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## 12 0.003225806 0.024193548 1.115734e-05 3.344819e-05 1.573496e-05 1.353342e-04
## 23 0.008064516 0.003225806 2.445810e-05 4.464534e-07 5.489708e-07 7.203676e-04
## 30 0.912903226 0.014516129 1.115734e-05 5.069133e-05 1.861629e-04 2.082912e-04
## 16 0.014516129 0.003225806 3.411663e-06 5.069133e-05 1.573496e-05 8.734573e-05
## 9 0.003225806 0.001612903 1.098722e-05 8.925177e-07 1.861629e-04 9.380860e-04
## dist5 dist6 dist
## 1 0.000000e+00 0.000000 0.00000000
## 12 7.897746e-05 1.417820 0.03470230
## 23 1.214417e-04 1.437773 0.03528566
## 30 4.691910e-05 1.457166 0.03584819
## 16 6.678451e-05 1.467324 0.03590740
## 9 7.093653e-05 1.460388 0.03603485
# head(datsummary1[order(datsummary1$dist), ])
### NEW NON-REGION 3 TARGET DATA:
x <- which.min(datsummary1$dist[-1])
# x <- which.min(datsummary3$dist[-1])