Welcome! The privLCM package provides an interface to creating
differentially-private contingency tables using a Bayesian Latent Class
approach. Samples can be drawn using the mcmc.sampler
function in the package.
This function works for any binary table of arbitrary dimension. However, computation times can be long if dimension and/or the number of latent classes is high. This is primarily due to the computation of the two-way marginal probabilities which requires enumerating over all possible combinations of the other variables.
We will implement our method using a \(2^5\) table with \(G=3\) latent classes. First, let’s simulate some data. Let N = 10,000.
set.seed(1234)
N=10000
P=5
##Simulating data
prob_var = runif(5)
##Creating the data frame
df = data.frame("Var1" = sample(1:2, N, replace = TRUE, prob = c(prob_var[1], 1-prob_var[1])),
"Var2" = sample(1:2, N, replace = TRUE, prob = c(prob_var[2], 1-prob_var[2])),
"Var3" = sample(1:2, N, replace = TRUE, prob = c(prob_var[3], 1-prob_var[3])),
"Var4" = sample(1:2, N, replace = TRUE, prob = c(prob_var[4], 1-prob_var[4])),
"Var5" = sample(1:2, N, replace = TRUE, prob = c(prob_var[5], 1-prob_var[5])))
head(df)
## Var1 Var2 Var3 Var4 Var5
## 1 2 1 1 2 2
## 2 2 1 1 2 1
## 3 2 1 1 2 1
## 4 2 1 2 2 1
## 5 2 1 1 2 1
## 6 2 2 1 1 1
Now, we need to calculate the two-way marginals. privLCM
also requires a matrix that encodes what each count represent. This
matrix should only have two non-NA
entries per row. The
other two entries denote which variables (as denoted by column position)
and values (as denoted by a 1
or 2
in each
row). The first row of the matrix tells the sampler which marginal the
first count represents and so on. Let’s calculate both pieces:
freq = c()
P = 5
comb.mat = matrix(ncol = P)
for(i in 1:(P-1)){
for(j in (i+1):P){
counts = plyr::count(df[,c(i,j)])
freq = c(freq, counts$freq)
tmp.mat = matrix(NA, nrow = 4, ncol = P)
tmp.mat[,i] = counts[,1]
tmp.mat[,j] = counts[,2]
comb.mat = rbind(comb.mat, tmp.mat)
}
}
comb.mat = comb.mat[-1,]
head(comb.mat)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1 NA NA NA
## [2,] 1 2 NA NA NA
## [3,] 2 1 NA NA NA
## [4,] 2 2 NA NA NA
## [5,] 1 NA 1 NA NA
## [6,] 1 NA 2 NA NA
Now, we are ready to go! Other inputs that we need to specifiy are
P
which denotes the dimension of the table (in this case
5), G
which denotes the number of latent classes,
eps
denotes the desired epsilon value, and
samp.size
which denotes number of observations in the data.
We can also control the number of samples drawn with
nsamples
.
library(privLCM)
library(data.table)
library(Rfast)
## Loading required package: Rcpp
## Loading required package: RcppZiggurat
##
## Attaching package: 'Rfast'
## The following object is masked from 'package:data.table':
##
## transpose
library(parallel)
cl = makeCluster(detectCores()-1)
clusterExport(cl, varlist = c("twoWay.probs.vectorized", "row.prob.vec","getProbs", "data.table", "rowprods", "P"))
samps = mcmc.sampler(freq, comb.mat, eps = 1, P = P, G = 3, nsamples = 5000, samp.size = N, cl = cl, .PiTuningParam = 0.075, .PsiTuningParam =75)
## [1] 100
## [1] 200
## [1] 300
## [1] 400
## [1] 500
## [1] 600
## [1] 700
## [1] 800
## [1] 900
## [1] 1000
## [1] 1100
## [1] 1200
## [1] 1300
## [1] 1400
## [1] 1500
## [1] 1600
## [1] 1700
## [1] 1800
## [1] 1900
## [1] 2000
## [1] 2100
## [1] 2200
## [1] 2300
## [1] 2400
## [1] 2500
## [1] 2600
## [1] 2700
## [1] 2800
## [1] 2900
## [1] 3000
## [1] 3100
## [1] 3200
## [1] 3300
## [1] 3400
## [1] 3500
## [1] 3600
## [1] 3700
## [1] 3800
## [1] 3900
## [1] 4000
## [1] 4100
## [1] 4200
## [1] 4300
## [1] 4400
## [1] 4500
## [1] 4600
## [1] 4700
## [1] 4800
## [1] 4900
## [1] 5000
stopCluster(cl)
Now, a quick word about the results that we return. The output of
mcmc.sampler
is a list. This list contains many elements,
including samples from all of our parameters and acceptance rates. It
also contains marginal probabilities calculated for each estimate and
full probabilities (if desired)
names(samps)
## [1] "marg_probs" "full_probs" "M" "Psi"
## [5] "Pi" "accept_rate_pi" "accept_rate_psi" "accept_rate_M"
## [9] "epsilon"
Now, let’s look at a few graphs. First, lets look at trace plots for some of the marginal probabilities.
plot(samps$marg_probs[-c(1:1000),1], ylab = "1st Marginal Probability", xlab = "Iterate")
abline(h = freq[1]/N, col = "red", lty = "dashed")
plot(samps$marg_probs[-c(1:1000),2], ylab = "2nd Marginal Probability", xlab = "Iterate")
abline(h = freq[2]/N, col = "red", lty = "dashed")